[TRTLLM-12347][feat] enable VSA in VisualGen by o-stoner · Pull Request #14280 · NVIDIA/TensorRT-LLM

o-stoner · 2026-05-19T00:36:02Z

Summary by CodeRabbit

New Features
- Added Video Sparse Attention (VSA) algorithm for visual generation models, enabling efficient sparse attention computation on Blackwell GPUs for supported Wan pipelines.
- Introduced flow_shift parameter to override scheduler configuration in Wan pipelines during inference (allows us to have an apples-to-apples quality comparison with FastVideo, where the Wan pipelines have different flow_shift values that what exists by default in the scheduler).
- Added VideoSparseAttentionConfig for controlling VSA sparsity levels.
Enhancements
- Extended attention mechanisms to support configurable gate tensors for fine-grained attention control.
- Improved sparse attention validation across multiple pipeline variants (FLUX, FLUX.2, LTX-2, Wan).

Description

Adds VSA attention backend for TRT-LLM VisualGen based on the following VSA paper. Integrates the B200 CuteDSL kernel here. Currently, this backend is supported for Wan 2.1 using the following fine-tuned model from FastVideo. This support will be extended to Wan 2.2 T2V 14B / TI2V 5B once ModelOpt fine-tuned weights are ready.

Quality/perf findings are summarized on the page here, and quality against H200 FastVideo with the same input noisy latent/flow_shift value are summarized here.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

o-stoner · 2026-05-19T00:38:05Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-19T00:43:23Z

PR_Github #49010 [ run ] triggered by Bot. Commit: bc6138d Link to invocation

tensorrt-cicd · 2026-05-19T02:35:45Z

PR_Github #49010 [ run ] completed with state SUCCESS. Commit: bc6138d
/LLM/main/L0_MergeRequest_PR pipeline #38748 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

o-stoner · 2026-05-20T20:20:27Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-20T20:26:20Z

PR_Github #49483 [ run ] triggered by Bot. Commit: 9cc9858 Link to invocation

tensorrt-cicd · 2026-05-20T22:58:45Z

PR_Github #49483 [ run ] completed with state SUCCESS. Commit: 9cc9858
/LLM/main/L0_MergeRequest_PR pipeline #39123 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xrq-phys · 2026-05-25T17:31:36Z

Suggested restructuring: reuse the CUTEDSL backend (from PR #13721) and split VSA-specific branches into a sparsity sub-config + kernel sub-directory

Hi, @o-stoner ! I checked in with the @zhenhuaw-me today and we'd like to coordinate the VSA integration so it composes with the CuTe-DSL backend that #13721 is about to land. Below is a concrete restructuring proposal — happy to discuss alternatives if any of these don't fit your kernel's constraints.

Context (what PR #13721 brings):

A new AttentionConfig.backend = "CUTEDSL" choice, served by tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py.
A directory tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/attention/ holding the packaged cubins (cubins/) and a thin Python runner (fmha.py) that resolves and launches them.
A new optional AttentionConfig.quant_attention_config sub-config (Optional[QuantAttentionConfig]) that turns on QK16PV8 (and, on TRTLLM, SAGE). Backend init reads it; absence means "default behavior".

Requested changes for #14280:

Drop the new "VSA" backend literal; reuse "CUTEDSL".
Given that VSA is a "CuTe DSL attention kernel with sparsity", it can be considered as the same backend. Concretely:
- In config.py (after CuTe DSL lands), VSA does not need to extend backend's Literal[...] set — it'll already include "CUTEDSL".
- Drop VSAAttentionBackend as a separate class registered in get_visual_gen_attention_backend. The factory keeps returning CuTeDSLAttention for "CUTEDSL"; that class (implemented in tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py) becomes the dispatcher between quantized attention and sparse attention.
Move the VSA CuTe DSL kernel source to tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/.
Instead of landing the kernels in cute_dsl_kernels/video_sparse_attention/, convention like cute_dsl_kernels/blackwell/<feature>/ could be better (see tensorrt_llm/_torch/cute_dsl_kernels/blackwell/ for LLM-side directory structures), so:
- cute_dsl_kernels/blackwell/attention/ — packaged cubins + runner for dense / QK16PV8 (PR 13721)
- cute_dsl_kernels/blackwell/video_sparse_attention/ — VSA JIT source + interface (this PR)

Add an Optional[SparseAttentionConfig] field on AttentionConfig next to quant_attention_config.
Mirror the pattern PR 13721 introduced: a sub-config that is None by default; setting it switches the backend into the sparse path. Something like:

class SparseAttentionConfig(StrictBaseModel):
    """Sparse-attention recipe (CUTEDSL backend / VSA only)."""
    vsa_sparsity: float = Field(0.875, ge=0.0, le=1.0, ...)
    skip_softmax_threshold: float = Field(0.0, ge=0.0)

class AttentionConfig(StrictBaseModel):
    backend: Literal["VANILLA", "TRTLLM", "FA4", "CUTEDSL"] = ...
    quant_attention_config: Optional[QuantAttentionConfig] = None     # from #13721
    sparse_attention_config: Optional[SparseAttentionConfig] = None   # new in this PR

Move the VSA caller-side wrapper into cute_dsl.py; let CuTeDSLAttention dispatch between the two CuTe DSL kernels.
Sketch (pseudocode, names are negotiable):

# attention_backend/cute_dsl.py
class CuTeDSLAttention(AttentionBackend):
    def __init__(self, ..., quant_attention_config=None, sparse_attention_config=None, **kw):
        # Mutually exclusive: at most one of quant_/sparse_attention_config is set
        # (config-level validator will enforces this).
        self.quant_attention_config = quant_attention_config
        self.sparse_attention_config = sparse_attention_config
        ...

    def forward(self, q, k, v, *, gate_compress=None, gate_fine=None, **kw):
        if self.sparse_attention_config is not None:
            # VSA path: tile / coarse-pool / topk / block-sparse JIT kernel
            return self._forward_vsa(q, k, v, gate_compress=gate_compress, gate_fine=gate_fine, **kw)
        # Standard dense / QK16PV8 path: packaged cubins
        return self._forward_dense(q, k, v, **kw)

_forward_dense is what cute_dsl.py already does in PR 13721 (calls into cute_dsl_kernels/blackwell/attention/).
_forward_vsa holds today's VSAAttentionBackend.forward body and calls into cute_dsl_kernels/blackwell/video_sparse_attention/.
The tiling / metadata builder (VSAMetadata, VSAMetadataBuilder, set_vsa_forward_context) can stay in cute_dsl.py alongside the dispatcher, or live in cute_dsl_kernels/blackwell/video_sparse_attention/interface.py and be imported lazily — runtime call.

This way, callers see exactly one "CUTEDSL" backend; the sparse vs dense decision is data-driven (presence of sparse_attention_config) rather than a third backend name.

Let me know what you think — or if any of these conflict with constraints I'm missing (e.g., kernel availability, config, etc.).

o-stoner · 2026-06-01T23:45:41Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-01T23:52:44Z

PR_Github #51445 [ run ] triggered by Bot. Commit: 8e3e4a9 Link to invocation

tensorrt-cicd · 2026-06-02T06:57:18Z

PR_Github #51445 [ run ] completed with state SUCCESS. Commit: 8e3e4a9
/LLM/main/L0_MergeRequest_PR pipeline #40853 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

o-stoner · 2026-06-02T16:50:33Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-02T16:57:05Z

PR_Github #51646 [ run ] triggered by Bot. Commit: 602e090 Link to invocation

coderabbitai · 2026-06-02T17:04:08Z

📝 Walkthrough

Walkthrough

This pull request adds comprehensive Video Sparse Attention (VSA) support to TensorRT-LLM's visual generation framework for Blackwell GPUs. It includes a new CUTE DSL persistent kernel with custom scheduler and PTX primitives, integration into CuTeDSLAttention and distributed attention backends, Wan pipeline orchestration with per-step metadata building, and extensive test coverage validating correctness, equivalence, performance, and multi-GPU distributed execution.

Changes

VSA Configuration and Type System

Layer / File(s)	Summary
VideoSparseAttentionConfig and sparse attention discriminator `tensorrt_llm/visual_gen/sparse_attention.py`, `tensorrt_llm/visual_gen/args.py`, `tensorrt_llm/visual_gen/__init__.py`	Adds `VideoSparseAttentionConfig` Pydantic type with `vsa_sparsity` (0.0-1.0) parameter, updates `SparseAttentionConfig` union to discriminate between skip-softmax and vsa algorithms, and introduces `AttentionConfig` validators enforcing backend-algorithm compatibility and mutual exclusivity of quant and sparse configs.
Flow shift scheduler override parameter `tensorrt_llm/visual_gen/params.py`	Adds optional `flow_shift` field to `VisualGenParams` to override scheduler's flow-matching shift per-pipeline-variant.

CuTe DSL Persistent Kernel Implementation

Layer / File(s)	Summary
Static persistent tile scheduler for VSA `tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/scheduler.py`	Implements `WorkTileInfo`, `ParamsBase`, `TileSchedulerParams`, and `StaticPersistentScheduler` for managing 3D tile space (blocks × heads × batches) scheduling and persistent work distribution across SM blocks with divmod-based tile-to-coordinate mapping.
Blackwell PTX-backed math and atomic operations `tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py`	Defines warp reductions (`warp_reduction_fmax`), atomic ops (shared/global `atomicAdd_f32`/`atomicMax_f32`), and exp2 emulation via polynomial evaluation and PTX inline assembly for Blackwell-specific float math.
VideoSparseAttentionForwardGroup2QInterleaveKV persistent kernel `tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py`	Implements multi-stage CUTE DSL kernel with load (TMA Q/K/V streaming), MMA (block-sparse QK GEMM), softmax (masked running max/sum), correction (LSE + rescaling), and epilogue (TMA O writeback) stages coordinated via warpgroup pipelines and barriers.
CuTe compilation cache and interface `tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/interface.py`	Exports `is_cute_supported()` capability gating, `block_sparse_attn_from_indices_cute()` kernel entry with per-shape JIT compilation cache, and `CUTE_AVAILABLE` flag for graceful fallback when CuTe/CUDA dependencies are unavailable.

Backend and Module Integration

Layer / File(s)	Summary
CuTeDSLAttention VSA sparse execution path `tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py`, `tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py`	Routes `forward()` through VSA-specific `_forward_vsa()` when `sparse_attention_config` is set; manages per-shape `VSAMetadata` caching, forward-context stack, tiling/partitioning, coarse cube selection via softmaxed pooling, optional CuTe kernel dispatch with dense SDPA fallback, and gated combination of coarse/fine outputs. Relaxes head-dim constraint to allow non-128 dimensions via runtime fallback.
Attention module VSA gate routing and backend selection `tensorrt_llm/_torch/visual_gen/modules/attention.py`	Routes `gate_compress`/`gate_fine` from caller layout into backend's expected 4D layout with optional HND-layout transpose; implements VSA backend selection for `SEPARATE_QKV` mode, validates VSA incompatibility with Attention2D, and forwards gate kwargs to backend.
Ulysses distributed gate tensor handling `tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py`	Transforms `gate_compress`/`gate_fine` via `all_to_all_4d` alongside Q/K/V to maintain correct post-A2A sharding layout and applies same transposition rules as inner backend expects.
Backend factory and config wiring `tensorrt_llm/_torch/visual_gen/attention_backend/utils.py`	Conditionally passes `attention_config.sparse_attention_config` into CUTEDSL backend kwargs to enable VSA configuration dispatch.

Wan Pipeline VSA Orchestration

Layer / File(s)	Summary
Wan pipeline VSA metadata building and forward context `tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`	Computes `VSAMetadataBuilder` once per `forward()` call, builds per-step metadata during denoising loop using current timestep, latent shape, and pacing parameters, and wraps transformer forward in `set_vsa_forward_context()` to make metadata available to attention layers.
Flow shift scheduler override mechanism `tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`, `tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py`	Accepts optional `flow_shift` parameter in pipeline `forward()` and `infer()`, detects applicable shift key in scheduler config (`shift` or `flow_shift`), logs override, and updates via `register_to_config()` before `set_timesteps()`.
Wan transformer block VSA gate projections `tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py`	Conditionally creates `to_gate_compress` and `to_gate_fine` Linear projections in `WanBlock` when VSA is active; computes gate tensors from normalized hidden state during forward and forwards to self-attention via kwargs.
Pipeline variant VSA support validation `tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py`, `tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py`, `tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py`, `tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py`	Validates `sparse_attention_config.algorithm` and rejects VSA with informative errors restricting it to Wan 2.1 T2V 14B (720P) for all other pipeline variants.
Pipeline loader VSA logging `tensorrt_llm/_torch/visual_gen/pipeline_loader.py`	Logs detailed VSA backend info including CUTE kernel availability and sparsity when CUTEDSL with VSA is enabled.

VSA Test Coverage

Layer / File(s)	Summary
CuTe kernel and VSA correctness tests `tests/unittest/_torch/visual_gen/test_attention_cute_dsl_vsa.py`	Validates VSA configuration (cross-attention VANILLA fallback, Attention2D incompatibility, sparsity collapse to dense at 0.0), tile/untile round-trip with padding verification, and CuTe kernel matching against dense `scaled_dot_product_attention` and masked fp32 reference implementations.
VSA integration and equivalence tests `tests/unittest/_torch/visual_gen/test_attention_integration.py`	Validates integrated VSA self-attention equivalence to naive dense SDPA at `sparsity=0.0` and verifies output finiteness (no NaN/Inf) across multiple sparsity values.
VSA performance benchmarks `tests/unittest/_torch/visual_gen/test_attention_perf.py`	Benchmarks VSA module-level vs VANILLA backend on Wan 2.2 T2V 14B production shapes across multiple sparsity values and compares VSA fine-stage kernel directly against FlashAttention 4 performance.
Multi-GPU Ulysses + VSA distributed validation `tests/unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py`	Distributed test harness validating Ulysses + VSA forward pass shape/finiteness and correctness via comparison against single-GPU reference with cosine similarity and tolerance-based assertions.
Test configuration and registry `tests/integration/test_lists/test-db/l0_b200.yml`	Registers VSA test into l0_b200 CI configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

VisualGen

Suggested reviewers

Shixiaowei02
kaiyux
chang-l
Funatiq

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is substantially incomplete. While it mentions the feature being added and references external resources, it lacks specific details about the implementation, architectural changes, and does not provide any test coverage information despite the template section being present.	Add a detailed 'Description' section explaining what VSA is, how it integrates with VisualGen, key architectural decisions, and the scope of backend support. Complete the 'Test Coverage' section by listing specific test files and test cases (e.g., test_attention_cute_dsl_vsa.py, test_wan_vsa_ulysses.py) that validate the VSA implementation.
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly summarizes the main change: enabling VSA (Video Sparse Attention) in VisualGen, which is the core objective of this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py`:
- Around line 201-216: The module-global _vsa_forward_context must be replaced
with a request-local contextvar to avoid cross-request clobbering: create a
contextvars.ContextVar[Optional[VSAMetadata]] (e.g. _vsa_forward_context_var)
and update set_vsa_forward_context to set the ContextVar and yield while
storing/resetting the returned token on exit, and update get_vsa_forward_context
to return _vsa_forward_context_var.get(None); keep the same function/class names
(set_vsa_forward_context, get_vsa_forward_context, VSAMetadata,
_vsa_forward_context -> _vsa_forward_context_var) so callers don’t change.
- Around line 527-541: The CuTe branch currently asserts when num_cubes exceeds
VSA_KERNEL_MAX_CUBES; instead modify the gating so the code falls back to dense
SDPA: include the condition num_cubes <= VSA_KERNEL_MAX_CUBES in the computation
of use_cute (the boolean used to choose the CuTe kernel), and remove or replace
the subsequent assert in the CuTe branch (the block referencing
VSA_KERNEL_MAX_CUBES and num_cubes) so oversized inputs simply skip CuTe and use
the existing dense fallback.

In
`@tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py`:
- Line 416: The hardcoded limit self.max_indices = 4 * 1024 can be exceeded by
variable_block_sizes, causing a shared-memory overflow when copying into
sVariable_block_sizes; add a runtime validation that
variable_block_sizes.shape[0] <= self.max_indices before the copy (or
assert/raise a clear error) and fail fast with a descriptive message, and/or
enforce the check earlier in block_sparse_attn_from_indices_cute in interface.py
so callers cannot pass larger arrays; update any related docs/comments to state
the max_indices constraint and reference the symbols max_indices,
sVariable_block_sizes, variable_block_sizes, and
block_sparse_attn_from_indices_cute when making the change.

In
`@tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py`:
- Around line 144-160: The inline assembly in the else branch currently emits a
shared-scope atomic ("atom.relaxed.shared::cta.cta.max.s32") but this path
targets global memory; update the asm string in the llvm.inline_asm call to use
the global scope ("atom.relaxed.global::cta.cta.max.s32") while keeping the same
operand ($0) and constraints, i.e., modify the asm literal passed to
llvm.inline_asm (the triple-quoted string) to replace "shared" with "global" so
the global-memory atomic is emitted.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py`:
- Around line 542-555: The code mutates the shared scheduler config when
applying a user flow_shift override (variables: flow_shift, sched_cfg,
shift_key, self.scheduler.register_to_config), which makes the change persist
across requests; instead, apply the override only request-scoped by either
restoring the original sched_cfg[shift_key] after the request or by creating a
request-local copy of the scheduler/config before calling set_timesteps();
specifically, capture the original value (orig_shift =
sched_cfg.get(shift_key)), call register_to_config only on a
cloned/configured-local scheduler or restore orig_shift via
register_to_config(**{shift_key: orig_shift}) after completing the request so
the shared scheduler config is not permanently mutated.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 489-503: The current code calls
self.scheduler.register_to_config(...) to apply a per-request flow_shift, which
mutates the shared scheduler and leaks that override to subsequent requests;
instead avoid mutating the shared scheduler by creating a request-local
scheduler/config or restoring the original value after use: either clone the
scheduler or its config (e.g., copy sched_cfg = dict(self.scheduler.config) and
apply the flow_shift to that local config or instantiate a shallow copy of the
scheduler) and use that local scheduler/config before calling set_timesteps(),
or if you must modify self.scheduler temporarily, capture the original
sched_cfg[shift_key] first and restore it immediately after the request
completes; reference flow_shift, sched_cfg, self.scheduler, register_to_config,
and set_timesteps when making the change.

In `@tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py`:
- Around line 369-390: The new VSA gate projections to_gate_compress and
to_gate_fine are created as full-width dense linears on every rank, which
misaligns with attn1's TP-local Q shards and duplicates work when tp_size>1;
change their construction to the same column-parallel/sharded setup used for the
Q projection (i.e., mirror the Q Linear creation parameters: use the same
mapping/partitioning, quant_config, skip_create_weights_in_init,
force_dynamic_quantization, and out-dim q_dim) so each rank only holds its
TP-local slice and the gate tensors line up with attn1's local Q shard. Ensure
you reference and reuse the same sharding/mapping pattern used when creating the
Q projection to_gate (or whichever variable constructs Q) so topology and sizes
match across ranks.

In `@tensorrt_llm/_torch/visual_gen/modules/attention.py`:
- Around line 471-475: The _reshape_gate helper reshapes gate tensors using the
global self.num_attention_heads which desyncs under tensor-parallelism; update
_reshape_gate (used for gate_compress / gate_fine) to compute the head count
from the incoming gate tensor (or use the same local head count used when
reshaping q/k/v) instead of self.num_attention_heads, then apply view/transpose
logic with that derived local_head_count so the final layout matches the
attention tensors and respects backend_layout (AttentionTensorLayout.HND)
handling.

In `@tests/unittest/_torch/visual_gen/test_attention_integration.py`:
- Around line 620-628: After constructing integrated (Attention(...,
config=cfg_vsa)), add an explicit assertion that the VSA path was chosen by
invoking the internal selector or flag (call integrated._build_vsa_setup() or
inspect any backend attribute set by that method) and assert it indicates
CUTEDSL/VSA; e.g., ensure the result/attribute equals the expected VSA backend
before proceeding to use integrated in the test so the test fails if CUTEDSL
silently falls back to dense.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da759100-77d8-4987-90f4-527b2030545e

📥 Commits

Reviewing files that changed from the base of the PR and between 059de9c and 602e090.

📒 Files selected for processing (26)

tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py
tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py
tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py
tensorrt_llm/_torch/visual_gen/attention_backend/utils.py
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/__init__.py
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/interface.py
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/scheduler.py
tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py
tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py
tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
tensorrt_llm/_torch/visual_gen/modules/attention.py
tensorrt_llm/_torch/visual_gen/pipeline_loader.py
tensorrt_llm/visual_gen/__init__.py
tensorrt_llm/visual_gen/args.py
tensorrt_llm/visual_gen/params.py
tensorrt_llm/visual_gen/sparse_attention.py
tests/integration/test_lists/test-db/l0_b200.yml
tests/unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py
tests/unittest/_torch/visual_gen/test_attention_cute_dsl_vsa.py
tests/unittest/_torch/visual_gen/test_attention_integration.py
tests/unittest/_torch/visual_gen/test_attention_perf.py

tensorrt-cicd · 2026-06-03T02:32:55Z

PR_Github #51646 [ run ] completed with state SUCCESS. Commit: 602e090
/LLM/main/L0_MergeRequest_PR pipeline #41029 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

o-stoner · 2026-06-03T19:12:21Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-03T19:18:05Z

PR_Github #51898 [ run ] triggered by Bot. Commit: 6079294 Link to invocation

tensorrt-cicd · 2026-06-04T04:40:10Z

PR_Github #51898 [ run ] completed with state FAILURE. Commit: 6079294
/LLM/main/L0_MergeRequest_PR pipeline #41254 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

o-stoner · 2026-06-15T20:12:08Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-15T20:19:07Z

PR_Github #54358 [ run ] triggered by Bot. Commit: 29a6226 Link to invocation

tensorrt-cicd · 2026-06-16T03:30:44Z

PR_Github #54358 [ run ] completed with state FAILURE. Commit: 29a6226
/LLM/main/L0_MergeRequest_PR pipeline #43428 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

o-stoner · 2026-06-16T20:21:54Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-16T20:28:32Z

PR_Github #54665 [ run ] triggered by Bot. Commit: be1916c Link to invocation

tensorrt-cicd · 2026-06-17T04:04:33Z

PR_Github #54665 [ run ] completed with state FAILURE. Commit: be1916c
/LLM/main/L0_MergeRequest_PR pipeline #43697 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

o-stoner · 2026-06-22T16:09:46Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-22T16:16:50Z

PR_Github #55050 [ run ] triggered by Bot. Commit: f7fcafe Link to invocation

o-stoner · 2026-06-22T17:43:53Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-22T17:50:34Z

PR_Github #55060 [ run ] triggered by Bot. Commit: 3730e37 Link to invocation

tensorrt-cicd · 2026-06-22T17:55:18Z

PR_Github #55050 [ run ] completed with state ABORTED. Commit: f7fcafe

Link to invocation

chang-l

CI coverage for test_wan_vsa_ulysses.py (8-GPU, cfg=2 × ulysses=4) — please confirm it actually runs before merge.

The test is collected via the unittest/_torch/visual_gen/multi_gpu directory entry in l0_dgx_b200.yml, which lives under the system_gpu_count: 8 / stage: post_merge / backend: pytorch condition. Two problems:

--add-multi-gpu-test only adds pre-merge multi-GPU stages, so it will not trigger this. Post-merge tests need /bot run --stage-list "" (or the heavy /bot run --post-merge).

Could you run python scripts/test_to_stage_mapping.py --tests "test_wan_vsa_ulysses" on this branch and confirm which stage runs it, then trigger that stage (e.g. /bot run --stage-list "") and verify it passes before merge?

tensorrt-cicd · 2026-06-23T00:47:54Z

PR_Github #55060 [ run ] completed with state FAILURE. Commit: 3730e37
/LLM/main/L0_MergeRequest_PR pipeline #44049 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

Signed-off-by: o-stoner <245287810+o-stoner@users.noreply.github.com>

o-stoner · 2026-06-23T18:30:13Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-23T18:40:55Z

PR_Github #55315 [ run ] triggered by Bot. Commit: 51ed5c5 Link to invocation

o-stoner · 2026-06-23T18:42:17Z

CI coverage for test_wan_vsa_ulysses.py (8-GPU, cfg=2 × ulysses=4) — please confirm it actually runs before merge.

The test is collected via the unittest/_torch/visual_gen/multi_gpu directory entry in l0_dgx_b200.yml, which lives under the system_gpu_count: 8 / stage: post_merge / backend: pytorch condition. Two problems:

--add-multi-gpu-test only adds pre-merge multi-GPU stages, so it will not trigger this. Post-merge tests need /bot run --stage-list "" (or the heavy /bot run --post-merge).

Could you run python scripts/test_to_stage_mapping.py --tests "test_wan_vsa_ulysses" on this branch and confirm which stage runs it, then trigger that stage (e.g. /bot run --stage-list "") and verify it passes before merge?

@chang-l unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py is run in the CI report here under L0_Test-x86_64-Multi-GPU by DGX_B200-8_GPUs-PyTorch-1, so IIUC I think it is being collected and triggered already, but please correct if I am misunderstanding. It failed due to a process kill in the previous run, but I will confirm on the next run whether or not it passes.

tensorrt-cicd · 2026-06-24T02:16:15Z

PR_Github #55315 [ run ] completed with state FAILURE. Commit: 51ed5c5
/LLM/main/L0_MergeRequest_PR pipeline #44267 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

o-stoner · 2026-06-24T16:14:46Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-24T16:20:52Z

PR_Github #55536 [ run ] triggered by Bot. Commit: 51ed5c5 Link to invocation

tensorrt-cicd · 2026-06-24T19:14:41Z

PR_Github #55536 [ run ] completed with state SUCCESS. Commit: 51ed5c5
/LLM/main/L0_MergeRequest_PR pipeline #44463 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

o-stoner · 2026-06-24T20:36:44Z

/bot run --disable-fail-fast --add-multi-gpu-test

o-stoner · 2026-06-24T21:25:50Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-24T21:31:31Z

PR_Github #55598 [ run ] triggered by Bot. Commit: fa1764e Link to invocation

tensorrt-cicd · 2026-06-25T04:55:06Z

PR_Github #55598 [ run ] completed with state FAILURE. Commit: fa1764e
/LLM/main/L0_MergeRequest_PR pipeline #44515 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned o-stoner May 19, 2026

zhenhuaw-me reviewed May 25, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/config.py Outdated

zhenhuaw-me reviewed May 25, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/video_sparse_attention/__init__.py Outdated

o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 9cc9858 to 22b2f5d Compare June 1, 2026 22:24

o-stoner marked this pull request as ready for review June 2, 2026 16:50

o-stoner requested review from a team as code owners June 2, 2026 16:50

o-stoner requested a review from hchings June 2, 2026 16:50

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

o-stoner requested a review from a team as a code owner June 3, 2026 18:30

o-stoner requested a review from yuxianq June 3, 2026 18:30

o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 6fe39d4 to 6079294 Compare June 3, 2026 18:45

chang-l reviewed Jun 3, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py Outdated

xxi-nv mentioned this pull request Jun 16, 2026

[TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend #14608

Merged

skip wan multi-gpu test if checkpoint unavailable

3583a9c

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

chang-l reviewed Jun 22, 2026

View reviewed changes

fix VSA correctness test to compare CuTe-DSL vs SDPA fallback

56a9400

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 3730e37 to 56a9400 Compare June 23, 2026 17:57

Merge branch 'main' into user/o-stoner/visual-gen-vsa

51ed5c5

Signed-off-by: o-stoner <245287810+o-stoner@users.noreply.github.com>

fix multi-GPU VSA + Ulysses test

fa1764e

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>

Uh oh!

Conversation

o-stoner commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

o-stoner commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

o-stoner commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

xrq-phys commented May 25, 2026

Uh oh!

o-stoner commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

o-stoner commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

o-stoner commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

o-stoner commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

o-stoner commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

o-stoner commented Jun 22, 2026

Uh oh!

o-stoner commented May 19, 2026 •

edited

Loading